{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic indexing and searching with RAGatouille\n",
    "\n",
    "In this quick example, we'll use the `RAGPretrainedModel` magic class to demonstrate how to:\n",
    "\n",
    "- **Build an index from raw documents**\n",
    "- **Search an index for relevant documents**\n",
    "- **Load an index and the associated pretrained model to update or query it.**\n",
    "\n",
    "Please note: Indexing is currently not supported on Google Colab and Windows 10.\n",
    "\n",
    "First, let's load up a pre-trained ColBERT model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:14] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "from ragatouille import RAGPretrainedModel\n",
    "\n",
    "RAG = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's all you need to do to load the model! All the config is now stored, and ready to be used for indexing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating an index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's index some documents now. We'll use data from Wikipedia, to build our Miyazaki-Index, which will store all you could ever know about Hayao Miyazaki('s wikipedia page).\n",
    "\n",
    "First, let's write a function to fetch the data from the Wikipedia with a clear user-agent, to be a good netizen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "def get_wikipedia_page(title: str):\n",
    "    \"\"\"\n",
    "    Retrieve the full text content of a Wikipedia page.\n",
    "    \n",
    "    :param title: str - Title of the Wikipedia page.\n",
    "    :return: str - Full text content of the page as raw string.\n",
    "    \"\"\"\n",
    "    # Wikipedia API endpoint\n",
    "    URL = \"https://en.wikipedia.org/w/api.php\"\n",
    "\n",
    "    # Parameters for the API request\n",
    "    params = {\n",
    "        \"action\": \"query\",\n",
    "        \"format\": \"json\",\n",
    "        \"titles\": title,\n",
    "        \"prop\": \"extracts\",\n",
    "        \"explaintext\": True,\n",
    "    }\n",
    "\n",
    "    # Custom User-Agent header to comply with Wikipedia's best practices\n",
    "    headers = {\n",
    "        \"User-Agent\": \"RAGatouille_tutorial/0.0.1 (ben@clavie.eu)\"\n",
    "    }\n",
    "\n",
    "    response = requests.get(URL, params=params, headers=headers)\n",
    "    data = response.json()\n",
    "\n",
    "    # Extracting page content\n",
    "    page = next(iter(data['query']['pages'].values()))\n",
    "    return page['extract'] if 'extract' in page else None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now, let's use it to fetch the page's content and check how long it is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "45346"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "full_document = get_wikipedia_page(\"Hayao_Miyazaki\")\n",
    "len(full_document)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot of characters! Thankfully, `RAGPretrainedColBERT.index()` also relies on a `CorpusProcessor`! It takes in various pre-processing functions and applies them to your documents before embedding and indexing them.\n",
    "\n",
    "By default, `CorpusProcessor` uses LlamaIndex's `SentenceSplitter`, with a chunk-size defined by your index's max document length. By default, `max_document_length` is 256 tokens, but you can set it to whatever you like.\n",
    "\n",
    "Let's keep our information units small and go for 180 when creating our index. We'll also add an optional document ID and an optional metadata entry for our index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "[Jan 24, 18:27:15] #> Creating directory .ragatouille/colbert/indexes/Miyazaki \n",
      "\n",
      "\n",
      "[Jan 24, 18:27:20] [0] \t\t #> Encoding 81 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/2 [00:00<?, ?it/s]/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      " 50%|█████     | 1/2 [00:05<00:05,  5.99s/it]/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n",
      "100%|██████████| 2/2 [00:07<00:00,  3.93s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:28] [0] \t\t avg_doclen_est = 129.74073791503906 \t len(local_sample) = 81\n",
      "[Jan 24, 18:27:28] [0] \t\t Creating 1,024 partitions.\n",
      "[Jan 24, 18:27:28] [0] \t\t *Estimated* 10,508 embeddings.\n",
      "[Jan 24, 18:27:28] [0] \t\t #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "WARNING clustering 9984 points to 1024 centroids: please provide at least 39936 training points\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Clustering 9984 points in 128D to 1024 clusters, redo 1 times, 20 iterations\n",
      "  Preprocessing in 0.00 s\n",
      "  Iteration 19 (0.40 s, search 0.39 s): objective=2093.81 imbalance=1.435 nsplit=0       \n",
      "[0.035, 0.04, 0.039, 0.035, 0.034, 0.037, 0.036, 0.033, 0.036, 0.039, 0.035, 0.038, 0.034, 0.039, 0.036, 0.037, 0.032, 0.036, 0.035, 0.038, 0.037, 0.037, 0.036, 0.037, 0.036, 0.033, 0.039, 0.033, 0.037, 0.036, 0.038, 0.039, 0.039, 0.034, 0.034, 0.034, 0.035, 0.038, 0.036, 0.04, 0.037, 0.037, 0.034, 0.033, 0.037, 0.033, 0.036, 0.038, 0.037, 0.036, 0.034, 0.035, 0.036, 0.036, 0.038, 0.037, 0.038, 0.036, 0.038, 0.035, 0.034, 0.036, 0.033, 0.034, 0.038, 0.034, 0.037, 0.038, 0.034, 0.034, 0.037, 0.036, 0.034, 0.038, 0.036, 0.034, 0.034, 0.037, 0.036, 0.035, 0.039, 0.038, 0.034, 0.039, 0.034, 0.036, 0.038, 0.037, 0.035, 0.039, 0.038, 0.036, 0.036, 0.036, 0.035, 0.037, 0.041, 0.034, 0.036, 0.039, 0.036, 0.042, 0.036, 0.036, 0.037, 0.035, 0.036, 0.034, 0.035, 0.034, 0.037, 0.036, 0.037, 0.033, 0.036, 0.038, 0.036, 0.036, 0.039, 0.038, 0.033, 0.033, 0.036, 0.038, 0.033, 0.038, 0.037, 0.035]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:28] [0] \t\t #> Encoding 81 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 2/2 [00:07<00:00,  3.56s/it]\n",
      "1it [00:07,  7.19s/it]\n",
      "100%|██████████| 1/1 [00:00<00:00, 1145.67it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:35] #> Optimizing IVF to store map from centroids to list of pids..\n",
      "[Jan 24, 18:27:35] #> Building the emb2pid mapping..\n",
      "[Jan 24, 18:27:35] len(emb2pid) = 10509\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 1024/1024 [00:00<00:00, 164583.36it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:35] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki/ivf.pid.pt\n",
      "Done indexing!\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'.ragatouille/colbert/indexes/Miyazaki'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "RAG.index(\n",
    "    collection=[full_document], \n",
    "    document_ids=['miyazaki'],\n",
    "    document_metadatas=[{\"entity\": \"person\", \"source\": \"wikipedia\"}],\n",
    "    index_name=\"Miyazaki\", \n",
    "    max_document_length=180, \n",
    "    split_documents=True\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's our index created! It's already compressed and save to disk, so you're ready to use it anywhere you want. By the way, the default behaviour of `index()` is to split documents, but if for any reason you'd like them to remain intact (if you've already preprocessed them, for example), you can set it to false to bypass it!\n",
    "\n",
    "Let's move on to querying our index now..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieving Documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`RAGPretrainedModel` has just indexed our document, so the index is already loaded into it and ready to use! \n",
    "\n",
    "Searching is very simple and straightforward, let's say I have a single query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading searcher for index Miyazaki for the first time... This may take a few seconds\n",
      "[Jan 24, 18:27:38] #> Loading codec...\n",
      "[Jan 24, 18:27:38] #> Loading IVF...\n",
      "[Jan 24, 18:27:38] Loading segmented_lookup_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:38] #> Loading doclens...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:00<00:00, 3758.34it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:38] #> Loading codes and residuals...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 1/1 [00:00<00:00, 398.09it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:38] Loading filter_pids_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:38] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...\n",
      "Searcher loaded!\n",
      "\n",
      "#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==\n",
      "#> Input: . What animation studio did Miyazaki found?, \t\t True, \t\t None\n",
      "#> Output IDs: torch.Size([32]), tensor([  101,     1,  2054,  7284,  2996,  2106,  2771,  3148, 18637,  2179,\n",
      "         1029,   102,   103,   103,   103,   103,   103,   103,   103,   103,\n",
      "          103,   103,   103,   103,   103,   103,   103,   103,   103,   103,\n",
      "          103,   103])\n",
      "#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
      "        0, 0, 0, 0, 0, 0, 0, 0])\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\\n\\n\\n=== Studio Ghibli ===\\n\\n\\n==== Early films (1985–1996) ====\\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\\'s designs for the film\\'s setting were inspired by Greek architecture and \"European urbanistic templates\".',\n",
       "  'score': 25.90448570251465,\n",
       "  'rank': 1,\n",
       "  'document_id': 'miyazaki',\n",
       "  'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},\n",
       " {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, Japanese: [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\\nBorn in Tokyo City in the Empire of Japan, Miyazaki expressed interest in manga and animation from an early age, and he joined Toei Animation in 1963. During his early years at Toei Animation he worked as an in-between artist and later collaborated with director Isao Takahata.',\n",
       "  'score': 25.572620391845703,\n",
       "  'rank': 2,\n",
       "  'document_id': 'miyazaki',\n",
       "  'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},\n",
       " {'content': 'Glen Keane said Miyazaki is a \"huge influence\" on Walt Disney Animation Studios and has been \"part of our heritage\" ever since The Rescuers Down Under (1990). The Disney Renaissance era was also prompted by competition with the development of Miyazaki\\'s films. Artists from Pixar and Aardman Studios signed a tribute stating, \"You\\'re our inspiration, Miyazaki-san!\"',\n",
       "  'score': 24.84041976928711,\n",
       "  'rank': 3,\n",
       "  'document_id': 'miyazaki',\n",
       "  'document_metadata': {'entity': 'person', 'source': 'wikipedia'}}]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 3 # How many documents you want to retrieve, defaults to 10, we set it to 3 here for readability\n",
    "results = RAG.search(query=\"What animation studio did Miyazaki found?\", k=k)\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But is it efficient? Let's check how long it takes ColBERT to embed our query and retrieve documents. Because ColBERT's main retrieval approach relies on `maxsim`, a very efficient operation, searching through orders of magnitudes more documents shouldn't take much longer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "52.8 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "RAG.search(query=\"What animation studio did Miyazaki found?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also batch queries, which will run faster if you've got many different queries to run at once. The output format is the same as for a single query, except it's a list of lists, where item at index `i` will correspond to the query at index `i`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2it [00:00, 105.94it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[[{'content': 'In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.\\n\\n\\n=== Studio Ghibli ===\\n\\n\\n==== Early films (1985–1996) ====\\nIn June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli\\'s first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki\\'s designs for the film\\'s setting were inspired by Greek architecture and \"European urbanistic templates\".',\n",
       "   'score': 25.90448570251465,\n",
       "   'rank': 1,\n",
       "   'document_id': 'miyazaki',\n",
       "   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},\n",
       "  {'content': 'Hayao Miyazaki (宮崎 駿 or 宮﨑 駿, Miyazaki Hayao, Japanese: [mijaꜜzaki hajao]; born January 5, 1941) is a Japanese animator, filmmaker, and manga artist. A co-founder of Studio Ghibli, he has attained international acclaim as a masterful storyteller and creator of Japanese animated feature films, and is widely regarded as one of the most accomplished filmmakers in the history of animation.\\nBorn in Tokyo City in the Empire of Japan, Miyazaki expressed interest in manga and animation from an early age, and he joined Toei Animation in 1963. During his early years at Toei Animation he worked as an in-between artist and later collaborated with director Isao Takahata.',\n",
       "   'score': 25.572620391845703,\n",
       "   'rank': 2,\n",
       "   'document_id': 'miyazaki',\n",
       "   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},\n",
       "  {'content': 'Glen Keane said Miyazaki is a \"huge influence\" on Walt Disney Animation Studios and has been \"part of our heritage\" ever since The Rescuers Down Under (1990). The Disney Renaissance era was also prompted by competition with the development of Miyazaki\\'s films. Artists from Pixar and Aardman Studios signed a tribute stating, \"You\\'re our inspiration, Miyazaki-san!\"',\n",
       "   'score': 24.84041976928711,\n",
       "   'rank': 3,\n",
       "   'document_id': 'miyazaki',\n",
       "   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}}],\n",
       " [{'content': \"== Early life ==\\nHayao Miyazaki was born on January 5, 1941, in Tokyo City, Empire of Japan, the second of four sons. His father, Katsuji Miyazaki (born 1915), was the director of Miyazaki Airplane, his brother's company, which manufactured rudders for fighter planes during World War II. The business allowed his family to remain affluent during Miyazaki's early life. Miyazaki's father enjoyed purchasing paintings and demonstrating them to guests, but otherwise had little known artistic understanding. He said that he was in the Imperial Japanese Army around 1940; after declaring to his commanding officer that he wished not to fight because of his wife and young child, he was discharged after a lecture about disloyalty.\",\n",
       "   'score': 24.939619064331055,\n",
       "   'rank': 1,\n",
       "   'document_id': 'miyazaki',\n",
       "   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},\n",
       "  {'content': \"Directed by Isao Takahata, with whom Miyazaki would continue to collaborate for the remainder of his career, the film was highly praised, and deemed a pivotal work in the evolution of animation. Miyazaki moved to a residence in Ōizumigakuenchō in April 1969, after the birth of his second son.Miyazaki provided key animation for The Wonderful World of Puss 'n Boots (1969), directed by Kimio Yabuki. He created a 12-chapter manga series as a promotional tie-in for the film; the series ran in the Sunday edition of Tokyo Shimbun from January to March 1969.\",\n",
       "   'score': 24.706043243408203,\n",
       "   'rank': 2,\n",
       "   'document_id': 'miyazaki',\n",
       "   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}},\n",
       "  {'content': \"Specific works that have influenced Miyazaki include Animal Farm (1945), The Snow Queen (1957), and The King and the Mockingbird (1980); The Snow Queen is said to be the true catalyst for Miyazaki's filmography, influencing his training and work. When animating young children, Miyazaki often takes inspiration from his friends' children, as well as memories of his own childhood.\\n\\n\\n== Personal life ==\\nMiyazaki married fellow animator Akemi Ōta in October 1965; the two had met while colleagues at Toei Animation. The couple have two sons: Goro, born in January 1967, and Keisuke, born in April 1969. Miyazaki felt that becoming a father changed him, as he tried to produce work that would please his children.\",\n",
       "   'score': 24.441795349121094,\n",
       "   'rank': 3,\n",
       "   'document_id': 'miyazaki',\n",
       "   'document_metadata': {'entity': 'person', 'source': 'wikipedia'}}]]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_results = RAG.search(query=[\"What animation studio did Miyazaki found?\", \"Miyazaki son name\"], k=k)\n",
    "all_results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's it for the basics of querying an index! You're now ready to index and retrieve documents with RAGatouille!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using an already-created index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the examples above, we embedded documents into an index and queried it during the same session. But a key feature is **persistence**: indexing is the slowest part, we don't want to have to do this every-time!\n",
    "\n",
    "Loading an already-created Index is just as straightforward as creating one from scratch. First, we'll load up an instance of RAGPretrainedModel from the index, where the full configuration of the embedder is stored:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/bclavie/miniforge3/envs/ragatouille/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:125: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "# This is the path to index. We recommend keeping this path format when using RAGatouille somewhere else.\n",
    "path_to_index = \".ragatouille/colbert/indexes/Miyazaki/\"\n",
    "RAG = RAGPretrainedModel.from_index(path_to_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's it! The index is now fully ready to be queried using `search()` as above.\n",
    "\n",
    "### Updating an index\n",
    "\n",
    "Once you've loaded an existing index, you might want to add new documents to it. RAGatouille supports this via the `RAGPretrainedModel.add_to_index()` function. Due to the way ColBERT stores documents as bags-of-embeddings, there are cases where recreating the index is more efficient than updating it -- you don't need to worry about it, the most efficient method is automatically used when you call `add_to_index()`.\n",
    "\n",
    "You want to expand, and cover more of Studio Ghibli, so let's get the Studio's page into our index too!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING: add_to_index support is currently experimental! add_to_index support will be more thorough in future versions\n",
      "[Jan 24, 18:27:50] #> Loading codec...\n",
      "[Jan 24, 18:27:50] #> Loading IVF...\n",
      "[Jan 24, 18:27:50] #> Loading doclens...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:00<00:00, 3019.66it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:50] #> Loading codes and residuals...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 1/1 [00:00<00:00, 282.33it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New index_name received! Updating current index_name (Miyazaki) to Miyazaki\n",
      "\n",
      "\n",
      "[Jan 24, 18:27:51] #> Note: Output directory .ragatouille/colbert/indexes/Miyazaki already exists\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:53] [0] \t\t #> Encoding 59 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:05<00:00,  5.94s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:59] [0] \t\t avg_doclen_est = 124.83050537109375 \t len(local_sample) = 59\n",
      "[Jan 24, 18:27:59] [0] \t\t Creating 1,024 partitions.\n",
      "[Jan 24, 18:27:59] [0] \t\t *Estimated* 7,364 embeddings.\n",
      "[Jan 24, 18:27:59] [0] \t\t #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "WARNING clustering 6997 points to 1024 centroids: please provide at least 39936 training points\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Clustering 6997 points in 128D to 1024 clusters, redo 1 times, 20 iterations\n",
      "  Preprocessing in 0.00 s\n",
      "  Iteration 19 (0.22 s, search 0.21 s): objective=1269.87 imbalance=1.532 nsplit=0       \n",
      "[0.037, 0.037, 0.036, 0.035, 0.03, 0.037, 0.032, 0.031, 0.032, 0.035, 0.036, 0.037, 0.033, 0.032, 0.034, 0.038, 0.029, 0.032, 0.037, 0.033, 0.036, 0.035, 0.032, 0.035, 0.034, 0.035, 0.036, 0.029, 0.034, 0.032, 0.036, 0.039, 0.036, 0.03, 0.032, 0.036, 0.031, 0.036, 0.034, 0.038, 0.035, 0.037, 0.033, 0.031, 0.034, 0.031, 0.033, 0.034, 0.033, 0.034, 0.033, 0.035, 0.034, 0.032, 0.032, 0.034, 0.034, 0.037, 0.039, 0.034, 0.031, 0.034, 0.033, 0.033, 0.035, 0.033, 0.037, 0.035, 0.03, 0.036, 0.035, 0.032, 0.031, 0.035, 0.032, 0.033, 0.036, 0.035, 0.033, 0.035, 0.034, 0.034, 0.031, 0.039, 0.033, 0.037, 0.032, 0.036, 0.032, 0.041, 0.037, 0.034, 0.032, 0.036, 0.037, 0.036, 0.037, 0.031, 0.036, 0.033, 0.038, 0.039, 0.032, 0.032, 0.037, 0.031, 0.033, 0.031, 0.034, 0.031, 0.036, 0.035, 0.035, 0.031, 0.034, 0.034, 0.034, 0.035, 0.036, 0.036, 0.032, 0.031, 0.035, 0.033, 0.033, 0.034, 0.031, 0.036]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "0it [00:00, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:27:59] [0] \t\t #> Encoding 59 passages..\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:04<00:00,  4.85s/it]\n",
      "1it [00:04,  4.93s/it]\n",
      "100%|██████████| 1/1 [00:00<00:00, 766.36it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:28:04] #> Optimizing IVF to store map from centroids to list of pids..\n",
      "[Jan 24, 18:28:04] #> Building the emb2pid mapping..\n",
      "[Jan 24, 18:28:04] len(emb2pid) = 7365\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "100%|██████████| 1024/1024 [00:00<00:00, 177214.36it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Jan 24, 18:28:04] #> Saved optimized IVF to .ragatouille/colbert/indexes/Miyazaki/ivf.pid.pt\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done indexing!\n",
      "Successfully updated index with 59 new documents!\n",
      " New index size: 140\n"
     ]
    }
   ],
   "source": [
    "new_documents = get_wikipedia_page(\"Studio_Ghibli\")\n",
    "\n",
    "RAG.add_to_index([new_documents])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And again, that's it! The index has been updated with your new document set, and the updates are already persisted to disk. You're now ready to query it with `search()`!"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "kernel": "python3",
   "name": ".m114",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/:m114"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
